Introduction

The following project will be analyzing data to be able to predict the profit percentage that the company StockX is proposed to make from the shoes that they resell. The data will be explored analytically to be able to understand the variance within and between variables. Through stratified sampling the data will be split and cross validation will be used as well to be able to fit into various machine learning models. After comparing models, the testing data will be fit to the model that produces the best RMSE and \(R^2\) value.

What is StockX?

StockX is a online marketplace that mainly focuses on reselling sneakers. As a consumer, you are not only able to buy but also sell your own pair of sneakers.

StockX is a third party platform that primarily buys and sells shoes. For people who are not familiar with the sneaker industry, it can be helpful to think of the StockX business as such: think of a shoe as a stock where different factors affect the price of the shoe at a certain time. For stocks, certain socioeconomic factors such as investor demand influence the price of stocks. If the number of buy orders outnumber the number of sell orders, stock prices rise, and vice versa. The same holds true for shoes, where different factors such as innovation or the name/brand associated with the shoe influence how valuable a shoe is within the shoe market. Organizations such as Yeezy and Offwhite (brands used in this data set) implement basic supply and demand techniques by limiting scale production to increase demand and value for their shoes and other items.

Libraries Used for this Project

library(readr)
library(tidyverse)
library(tidymodels)
library(ISLR)
library(rpart.plot)
library(vip)
library(janitor)
library(xgboost)
library(ISLR2) 
library(discrim)
library(poissonreg)
library(corrr)
library(corrplot)
library(klaR) 
library(pROC)
library(glmnet)
library(dplyr)
library(randomForest)
library(rpart)
library(ranger)
library(vip)
library(lubridate)
library(kernlab)
library(kknn)
tidymodels_prefer()

Read the data file into R. This file was a csv file. I found this file on the StockX website that challenged its audience in 2019 to enter a data contest using the same data that is used for this project. The link to the original website can be found here.

sneakers <- read.csv("Data/StockX-Data.csv")

Cleaning the Data

Next we will do some cleaning. After reading the file in I noticed that there were certain variables that were read in as characters, however they will need to be used as integers therefore we will be converting them to integers before we continue. Likewise the dates that are provided were read in as characters and we will have to switch them to be read as a date format.

# Clean names
sneakers <- sneakers %>%
  clean_names()

#Sale.Price from chr to int
sneakers$sale_price <- parse_number(sneakers$sale_price)
sneakers$sale_price <- as.integer(sneakers$sale_price)

# Retail.Price from chr to int
sneakers$retail_price <- parse_number(sneakers$retail_price)
sneakers$retail_price <- as.integer(sneakers$retail_price)

# Order.Date from chr to date
sneakers$order_date <- as.Date(sneakers$order_date, "%m/%d/%y")

# Release.Date from chr to date
sneakers$release_date <- as.Date(sneakers$release_date, "%m/%d/%y")

#Check for Missing Values in the dataset
sum(is.na(sneakers))
## [1] 0

There are no missing values within the data set, therefore we will continue to use the entire data set.

Adding additional information

There are a couple new variables that can be added to this data set that can further enhance the analysis. As of right now, the variables containing dates serve us no purpose however we can convert them to weeks relative to each other to be able to incorporate them into our analysis. Similarly, we can use the days of the week that each shoe was ordered to see if it influences the profit percentage in any way. Another very important variable that is missing in this data set is the profit that each shoe makes. This can easily be calculated by finding the difference between the sale price and the retail price of each shoe. Likewise the profit percentage can be calculated as well by dividing the profit by the retail price of each shoe.

# Add the number of weeks between the release date and the order date
sneakers$weeks <- round(difftime(sneakers$order_date, sneakers$release_date, units = "weeks"))

# Add in the day of the week that the shoe was ordered/purchased
sneakers$day_of_week <- format(as.Date(sneakers$order_date), "%A")

# Add how much profit was made on each sale
sneakers$profit <- sneakers$sale_price - sneakers$retail_price

# Add the percentage of the profit made
sneakers$profit_perc <- round((sneakers$profit / sneakers$retail_price) * 100)

Exploratory Data Analysis (EDA)

The entire exploratory data analysis will be done on the entire data set. The data set overall has 99,956 observations, with originally 8 variables, however after adding
new variables to the data set it actually has 12 variables. Each observation represents the sale of a single shoe.

Shoe Size

sneakers %>%
  count(shoe_size) %>%
  arrange(-n) %>%
  head()
##   shoe_size     n
## 1      10.0 11093
## 2       9.0  9706
## 3      11.0  9251
## 4      10.5  8784
## 5       9.5  8685
## 6      12.0  7297

There are 26 different shoe sizes in the dataset. The top three sizes across the entire data set are: * 10 (11093)

  • 9 (9706)

  • 11 (9251)

Brands

sneakers %>%
  select(brand) %>%
  distinct() %>%
  group_by() %>%
  head()
## # A tibble: 2 × 1
##   brand      
##   <chr>      
## 1 " Yeezy"   
## 2 "Off-White"

There are only two brands within the 99,956 different observations found in this dataset: Yeezy and Off-White.

Next we will be separating the two brands into their own data frames to be able to analyze them separately.

Off-White

# Create a new data frame for all the Off-White brand sneakers
offwhite <- sneakers %>%
  select(order_date, brand, sneaker_name, sale_price, retail_price, release_date, shoe_size, buyer_region, weeks, day_of_week, profit, profit_perc) %>%
  filter(brand == "Off-White")

offwhite %>%
  count(shoe_size) %>%
  arrange(-n) %>%
  head()
##   shoe_size    n
## 1      10.0 3654
## 2      11.0 3109
## 3       9.0 2703
## 4      10.5 2537
## 5       9.5 2255
## 6      12.0 2255

This Off-White dataset is missing the size 13.0 and 14.5, therefore there are only 24 sizes as opposed to 26. The top three sizes with the most shoes for Off-White are: * 10 (3654)

  • 11 (3109)

  • 9 (2703)

Yeezy

# Yeezy dataframe

# Clean brand names to not include spaces
sneakers$brand <- gsub(" ", "", sneakers$brand)

# Create a new data frame for all the Yeezy brand sneakers
yeezy <- sneakers %>%
  select(order_date, brand, sneaker_name, sale_price, retail_price, release_date, shoe_size, buyer_region, weeks, day_of_week, profit, profit_perc) %>%
  filter(brand == "Yeezy")

yeezy %>%
  count(shoe_size) %>%
  arrange(-n) %>%
  head()
##   shoe_size    n
## 1      10.0 7439
## 2       9.0 7003
## 3       9.5 6430
## 4      10.5 6247
## 5      11.0 6142
## 6      12.0 5042

This Yeezy dataset is missing the size 15, therefore there are only 25 sizes as opposed to 26. The top three sizes with the most shoes for Yeezy are: * 10 (7439)

  • 9 (7003)

  • 9.5 (6430)

Within each brand there are different types of sneakers. Next we will look through each of the data frames for both brands to see how many individual sneakers there are.

#Find the number of sneakers listed for Off-White and the quantities of each.
offwhite %>%
  count(sneaker_name) %>%
  arrange(-n)%>%
  head()
##                                        sneaker_name    n
## 1 Air-Jordan-1-Retro-High-Off-White-University-Blue 4635
## 2              Nike-Air-Presto-Off-White-Black-2018 1884
## 3              Nike-Air-Presto-Off-White-White-2018 1883
## 4                  Nike-Air-VaporMax-Off-White-2018 1591
## 5         Nike-Blazer-Mid-Off-White-All-Hallows-Eve 1435
## 6             Nike-Blazer-Mid-Off-White-Grim-Reaper 1398

For the Off-White brand there are 30 individual sneaker types, where the sneakers with the most sales are: * Air-Jordan-1-Retro-High-Off-White-University-Blue (4635)

  • Nike-Air-Presto-Off-White-Back-2018 (1884)

  • Nike-Air-Presto-Off-White-White-2018 (1883)

#Find the number of sneakers listed for Yeezy and the quantities of each.
yeezy %>%
  count(sneaker_name) %>%
  arrange(-n) %>%
  head()
##                            sneaker_name     n
## 1      adidas-Yeezy-Boost-350-V2-Butter 11423
## 2 Adidas-Yeezy-Boost-350-V2-Beluga-2pt0 10395
## 3       Adidas-Yeezy-Boost-350-V2-Zebra 10110
## 4   Adidas-Yeezy-Boost-350-V2-Blue-Tint  9297
## 5 Adidas-Yeezy-Boost-350-V2-Cream-White  9097
## 6      Adidas-Yeezy-Boost-350-V2-Sesame  5553

For the Yeezy brand there are 20 individual sneaker types, where the sneakers with the most sales are: * Adidas-Yeezy-Boost-350-V2-Butter (11423)

  • Adidas-Yeezy-Boost-350-V2-Beluga-2pt0 (10395)

  • Adidas-Yeezy-Boost-350-V2-Zebra (10110)

In total there are 50 different kinds of sneakers across 2 different brands, 30 for Off-White, and 20 for Yeezy.

Next we will look at sales for both the Off-White brand and the Yeezy brand, and compare them to the average retail price, average sale price, and max sale price for each shoe and their kind.

# Find average sale and retail price and max sale price for each sneaker, arranged by highest average sale price to lowest
offwhite %>%
  group_by(sneaker_name) %>%
  summarize(avg_retail_price=mean(retail_price), avg_sale_price=mean(sale_price), max_sale_price=max(sale_price)) %>%
  arrange(-avg_sale_price) %>%
  head()
## # A tibble: 6 × 4
##   sneaker_name                    avg_retail_price avg_sale_price max_sale_price
##   <chr>                                      <dbl>          <dbl>          <int>
## 1 Air-Jordan-1-Retro-High-Off-Wh…              190          1826.           2950
## 2 Air-Jordan-1-Retro-High-Off-Wh…              190          1770.           4050
## 3 Nike-Air-Presto-Off-White                    160          1236.           2160
## 4 Nike-Air-Force-1-Low-Virgil-Ab…              150           976.           1500
## 5 Nike-Air-Max-97-Off-White-Elem…              190           894.           1290
## 6 Nike-Air-VaporMax-Off-White                  250           857.           2399

Based on our output we conclude the follwoing:

The most expensive sales for Off-White shoes are: * Air-Jordan-1-Retro-High-Off-White-Chicago (4050)

  • Air-Jordan-1-Retro-High-Off-White-University-Blue (3680)

  • Air-Jordan-1-Retro-High-Off-White-White (2950)

The most expensive Off-White shoes on average for sales price are: * Air-Jordan-1-Retro-High-Off-White-White (1826.07)

  • Air-Jordan-1-Retro-High-Off-White-Chicago (1769)

  • Nike-Air-Presto-Off-White (1236.05)

# Find average sale and retail price and max sale price for each sneaker, arranged by highest average sale price to lowest
yeezy %>%
  group_by(sneaker_name) %>%
  summarize(avg_retail_price=mean(retail_price), avg_sale_price=mean(sale_price), max_sale_price=max(sale_price)) %>%
  arrange(-avg_sale_price) %>%
  head()
## # A tibble: 6 × 4
##   sneaker_name                    avg_retail_price avg_sale_price max_sale_price
##   <chr>                                      <dbl>          <dbl>          <int>
## 1 Adidas-Yeezy-Boost-350-Low-Tur…              200          1532.           2300
## 2 Adidas-Yeezy-Boost-350-Low-Oxf…              200          1012.           1470
## 3 Adidas-Yeezy-Boost-350-Low-Moo…              200           997.           2000
## 4 Adidas-Yeezy-Boost-350-Low-Pir…              200           984.           1455
## 5 Adidas-Yeezy-Boost-350-V2-Core…              220           938.           1575
## 6 Adidas-Yeezy-Boost-350-Low-Pir…              200           895.           1300

The most expenisive sales for Yeezy shoes are: * Adidas-Yeezy-Boost-350-Low-Turtledove (2300)

  • Adidas-Yeezy-Boost-350-Low-Moonrock (2000)

  • Adidas-Yeezy-Boost-350-V2-Blue-Tint (2000)

The most expensive Yeezy shoes on average for sales price are: * Adidas-Yeezy-Boost-350-Low-Turtledove (1531.66)

  • Adidas-Yeezy-Boost-350-Low-Oxford-Tan (1011.51)

  • Adidas-Yeezy-Boost-350-Low-Moonrock (996.71)

Next we will find out how many buyer regions there are and what are they:

sneakers %>% 
  count(buyer_region) %>%
  arrange(-n) %>%
  head()
##   buyer_region     n
## 1   California 19349
## 2     New York 16525
## 3       Oregon  7681
## 4      Florida  6376
## 5        Texas  5876
## 6   New Jersey  4720

There are a total of 51 regions where the shoes in the data set was sold, where the top regions with the most sales are: * California (19349)

  • New York (16525)

  • Oregon (7681)

The regions for the most part are states, however District of Colombia is considered a region in this data set, which is not a state but rather the capital of the U.S.

In this next section we will look at the different variables and observe how they perform overall, in order to determine any form of correlation.

Sneakers

The following graph shows the number of shoes sold for each of the different sneaker types. It is evident that there are certain shoes that are more popular than others.

sneakers %>%
  count(sneaker_name) %>%
  mutate(sneaker_name = fct_reorder(sneaker_name, n)) %>%
  ggplot(aes(sneaker_name, n)) + 
  geom_col(fill = "limegreen") + 
  labs(title = "Total Sales Per Sneaker", x = "Sneaker Name", y = "Number of Shoes") + 
  coord_flip()

This next graph shoes the profit that was made for each individual shoe. Here we can notice that the shoes with the most profit do not exactly match the most popular shoes, therefore we can conclude that some shoes run for higher prices, and are available in smaller quantities.

sneakers %>%
  group_by(sneaker_name) %>%
  summarise(avg_profit = round(mean(profit))) %>%
  ggplot(aes(reorder(sneaker_name, avg_profit), avg_profit)) +
  geom_col(fill = "limegreen") +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(title = "Average Profit Per Sneaker", x = "Sneaker", y = "Average Profit") +
  coord_flip()

Buyer Regions

The nest few graphs will look into the statistics for the buyer regions, or in other words the states. We would like to see if the location of the buy has any importance in the way that it influences the sales of the shoes.

In this first graph, we examine the number of shoes sold in each state. It is clear that California has a high number of sales, with New York following.

sneakers %>%
  group_by(buyer_region) %>%
  count() %>%
  ggplot(aes(reorder(buyer_region, n), n)) + 
  geom_col(fill = "limegreen") + 
  labs(title = "Total Number of Shoes Sold Per State", x = "State", y = "Number of Shoes") + 
  coord_flip()

This graph shows the total amount in sales that each state made. It is observed that the number of sales are correlated with the total amount of money made from sales.

sneakers %>%
  group_by(buyer_region) %>%
  summarise(total_purchased = sum(sale_price)) %>%
  ggplot(aes(reorder(buyer_region, total_purchased), total_purchased)) + 
  geom_col(fill = "limegreen") + 
  scale_y_continuous(labels = scales::dollar_format()) + 
  labs(title = "Total Sales Per State", x = "State", y = "Total in Sales (Dollars)") + 
  coord_flip()

To further examine the influence of the buyer region we look at the average sale price for a the shoes sold in each state. However, the average sale price also has to do
with the types of shoes sold in each state too.

sneakers %>%
  group_by(buyer_region) %>%
  summarise(avg_purchased = mean(sale_price)) %>% 
  ggplot(aes(reorder(buyer_region, avg_purchased), avg_purchased)) + 
  geom_col(fill = "limegreen") + 
  scale_y_continuous(labels = scales::dollar_format()) + 
  labs(title = "Average Sale Price Per State", x = "State", y = "Average Shoe Price (Dollar)") + 
  coord_flip()

Likewise, we will also look at the average shoe size that each state buys. This is relevant because some sizes are more attainable than others, therefore we can get an idea of which states perhaps buy rarer shoes.

sneakers %>%
  select(buyer_region, shoe_size) %>%
  group_by(buyer_region) %>%
  summarise(avg_shoe_size = mean(shoe_size)) %>%
  ggplot(aes(reorder(buyer_region, avg_shoe_size), avg_shoe_size)) +
  geom_point(color = "limegreen") + 
  labs(title = "Average Shoe Size by State", x = "State", y = "Average Shoe Size") +
  coord_flip()

Shoe Size

After looking at shoe size in states, we will examine the influence that shoe size has on the sales.

In the graph, we notice that there is a normal distribution for the sizes of the shoes which is expected.

sneakers %>%
  group_by(shoe_size) %>%
  count() %>%
  ggplot(aes(shoe_size, n)) + 
  geom_point(color = "limegreen", size = 3) + 
  scale_x_continuous(breaks = 3:17) + 
  labs(title = "Total Number of Sales Per Shoe Size", x = "Shoe Size", y = "Number of Sales")

The following graph shows the shoe size with the profit percent that was made on each sale. The scatterplot is utilized to be able to identify shoes that are outliers that contribute to a higher percent profit.

sneakers %>%
  select(shoe_size, profit_perc) %>%
  group_by(shoe_size) %>%
  ggplot(aes(x = shoe_size, y = profit_perc)) +
  geom_point(color = "limegreen") + 
  scale_x_continuous(breaks = 3:17) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(title = "Total Profit Percent Per Shoe Size", x = "Shoe Size", y = "Percent of Profit")

After looking at the percent profit, it is critical to look over the average sale price that each shoe size sells for. Because larger shoe sizes are harder to acquire it is reasonable to note that they sell for higher prices on average as shown in the graph.

sneakers %>% 
  select(shoe_size, sale_price) %>%
  group_by(shoe_size) %>%
  summarise(avg_sale_price = round(mean(sale_price))) %>%
  ggplot(aes(x = shoe_size, y = avg_sale_price)) + 
  geom_col(fill = "limegreen") +
  scale_x_continuous(breaks = 3:17) + 
  scale_y_continuous(labels = scales::dollar_format()) + 
  labs(title = "Average Sale Price Per Shoe Size", x = "Shoe Size", y ="Average Sale Price")

Weeks

Lastly, we will analyze how the time of an order compared to the release date of each shoe can affect the sales of each shoe.

The first graph that we used was to see how the percent profit of the two brands across the weeks of their sale. It is also important to note that Off-White sales do not go past 75 weeks, while Yeezy sales do. Also it is seen that Off-White shoes tend to make a higher percentage of profit in comparison to Yeezy.

sneakers %>%
  ggplot(aes(x = as.numeric(weeks), y = profit_perc, color = factor(brand))) +
  geom_point() +
  scale_colour_manual(values = c("#7CFC00", "#228B22")) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) + 
  labs(title = "Weeks when Shoes Sold for Most Profit", x = "Weeks", y = "Percent of Profit")

Another graph that was made was the following to see what weeks made the most sales in comparison to their release dates. Here it is clear that many sales were made close to the release date.

sneakers %>%
  group_by(weeks) %>%
  count() %>%
  ggplot(aes(as.numeric(weeks), n)) + 
  geom_point(color = "limegreen", size = 3) +
  labs(title = "Weeks when the Most Shoes Were Sold", x = "Weeks", y = "Number of Shoes")

Out of curiosity, a graph displaying which day of the week sold the most shoes was made. Based on the graph, Friday seems to be the most popular day to purchase shoes.

sneakers %>%
  group_by(day_of_week) %>%
  count() %>%
  ggplot(aes(n, day_of_week)) +
  geom_col(fill = "limegreen") +
  labs(title = "Number of Shoes Sold Per Weekday", x = "Weekday", y = "Number of Shoes") + 
  scale_y_discrete(limits = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")) + 
  coord_flip()

Although Friday is the day that sells the most shoes, Wednesday is the day that makes the most profit, as seen in the graph.

sneakers %>%
  group_by(day_of_week) %>%
  summarise(avg_profit_perc = round(mean(profit_perc))) %>% 
  ggplot(aes(reorder(x = day_of_week, avg_profit_perc), y = avg_profit_perc)) + 
  geom_col(fill = "limegreen") +
  labs(title = "Average Profit Percent per Weekday", x = "Weekday", y = "Average Percent of Profit") +
  scale_x_discrete(limits = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")) + 
  coord_flip()

Data Splitting

After exploring our data and understanding the correlation that the variables have with each other, it is time to split the data. However before we do that it is important to convert certain variables into factors so that we can later use them in our recipe.

#Convert variables to factors
sneakers$brand <- as.factor(sneakers$brand)
sneakers$sneaker_name <- as.factor(sneakers$sneaker_name)
sneakers$buyer_region <- as.factor(sneakers$buyer_region)
sneakers$day_of_week <- as.factor(sneakers$day_of_week)
sneakers$weeks <- as.numeric(sneakers$weeks)

Next we split the data and stratify it. When we stratify the data we make sure the data is being split evenly and between training and testing data sets. This makes sure that each set contains the same proportions of the response variable to reflect the entire data set overall. This helps the process of modeling because it allows the models to work with a smaller version of the overall data in order to acheive higher accuracy in predictions

Cross Validation

Afterwards the split data will be assigned as training or testing. And before creating the recipe, I will fold my training data into 10 folds, with 2 repeats in order to be able to perform cross validation. Stratified cross validation will enhance the modeling process because it will further break down the training data set into smaller data sets that will be ran to create mini models, and essentially the best model will be used on the overall training data, and this too will help in determining the best model overall with the highest accuracy.

Originally, I had set up my cross validation repeats to equal 5, however, because I was working with a data set with a large amount of observations and many models took a long time to run, I decided that 2 repeats would be more appropriate.

set.seed(0714)

# Split the data 
sneakers_split <- initial_split(sneakers, strata = profit_perc, prop = 0.7)

# Assign the training and testing data
sneakers_train <- training(sneakers_split)
sneakers_test <- testing(sneakers_split)

# V-fold cross validation
sneakers_fold <- vfold_cv(sneakers_train, strata = profit_perc, v = 10, repeats = 2)

Correlation Plot

Before creating our recipe, a correlation plot will be made in order to examine which variables are positively or negatively correlated with each other. This will provide a vague idea of which variables to use in our recipe to predict our response variable. For this data set I have decided to make the profit percentage the response variable, therefore I will note any variables that have high correlation to the profit percentage. As seen in the plot, sale price and, of course, profit are extremely correlated to the profit percentage, therefore they will not be included in the recipe in order to avoid multicollinearity.

sneakers_train2 <- sneakers_train[,sapply(sneakers_train, is.numeric)]
sneakers_train2 %>%
  cor() %>%
  corrplot(type = "lower", diag = FALSE,
           method = 'color', addCoef.col = "Black")

Finally, we will create our recipe that will be predicting the profit percentage of the shoe sales, given the variables seen in the code below. Additionally, we will dummy code all the nominal predictors, as well as centering and scaling them.

sneakers_recipe <- recipe(
  # predict the profit percent using all the predictors except, profit, release_date, order_date.
  # weeks will account for the dates that are not being included
  # will use the training data set for this recipe
  profit_perc ~ brand + sneaker_name + retail_price + buyer_region + shoe_size + weeks + day_of_week, data = sneakers_train) %>%
  # dummy code all the variables
  step_dummy(all_nominal_predictors()) %>%
  # scale and center all the predictor variables
  step_normalize(all_predictors())

Model Building

Now that we have our recipe made we can begin the modeling process. Because the response variable is a continuous quantitative variable, I will be making regression models. The models that will be used for this data set are: * Linear Regression Model

  • Regularization Regression
    • Lasso Regression
    • Ridge Regression
  • Tree-Based Models
    • Decision Tree
    • Random Forest
  • KNN (Nearest Neighbors)

Linear Regression Model

Because we are dealing with a regression problem, I began with a simple linear regression model. For this model I will not be using cross validation. First I create the model and the workflow, as well as add the recipe.

# Create the model
lm_model <- linear_reg() %>%
  set_engine("lm")

# Create the workflow
lm_wf <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(sneakers_recipe)

Next I fit the training data to the model.

# Fit model to training set
lm_fit <- fit(lm_wf, sneakers_train)

Afterwards I make a table that will list the prediction values of the model along with the actual values.

# Table w/ predicted and actual values
sneakers_train_res <- predict(lm_fit, new_data = sneakers_train %>% select(-profit_perc))

sneakers_train_res <- bind_cols(sneakers_train_res, sneakers_train %>% select(profit_perc))

In order to understand the performance of this regression model, I outputted the RMSE, \(R^2\), and the MAE using the following code. The RMSE tells us how well the regression model can predict the value of the response variable in absolute terms. On the other hand, \(R^2\) tells us how well a model can predict the value of the response variable in percentage terms, or in this case in the form of a decimal. The MAE tells us how accurate our predictions are and, what the amount of deviation from the actual values is.

# Metric Set w/ RMSE, MSE, R^2
sneakers_metrics <- metric_set(rmse, rsq, mae)
sneakers_metrics(sneakers_train_res, truth = profit_perc, 
                 estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      55.0  
## 2 rsq     standard       0.868
## 3 mae     standard      30.3

Finally, in order to visually see how the model performed, I graphed the predictions versus the actual responses. For this model, it seems like the model somewhat follows the direction of the line, however there are certain predictions that are either overestimated or underestimated.

# Plot for Predicted vs Actual Values
sneakers_train_res %>%
  ggplot(aes(x = .pred, y = profit_perc)) +
  geom_point(alpha = 0.2) +
  geom_abline(lty = 2) +
  theme_bw() +
  coord_obs_pred()

Regularization Regression

Next we will be fitting the training data into Ridge and Lasso regression models.

Ridge Regression

Similar to linear regression, we will create the model, and workflow. However for these models, we will be tuning them, therefore we will use our folds that we created for cross-validation.

# Create a model
ridge_spec <- 
  linear_reg(penalty = tune(), mixture = 0) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

# Create a workflow 
ridge_wf <- workflow() %>%
  add_recipe(sneakers_recipe) %>%
  add_model(ridge_spec)

Next, the tuning grid will be set up that will determine the parameters for this model. The parameter tuned in this model will be penalty, while mixture will be set as 0 in order for it to be classified as a ridge regression model.

penalty_grid_r <- grid_regular(penalty(range = c(-5, 5)), levels = 50)

Finally, the model will be tuned, and cross-validation folds will be fit with the workflow as well.

tune_res_ridge <- tune_grid(
  ridge_wf, 
  resamples = sneakers_fold, 
  grid = penalty_grid_r,
  control = control_grid(verbose = TRUE)
)

Next, it is helpful to visualize the results, therefore autoplot will be used.

autoplot(tune_res_ridge)

In order to know which model performed the best I will collect the metrics of all the models and choose the best.

collect_metrics(tune_res_ridge)
## # A tibble: 100 × 7
##      penalty .metric .estimator   mean     n std_err .config              
##        <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
##  1 0.00001   rmse    standard   56.5      20 0.455   Preprocessor1_Model01
##  2 0.00001   rsq     standard    0.862    20 0.00159 Preprocessor1_Model01
##  3 0.0000160 rmse    standard   56.5      20 0.455   Preprocessor1_Model02
##  4 0.0000160 rsq     standard    0.862    20 0.00159 Preprocessor1_Model02
##  5 0.0000256 rmse    standard   56.5      20 0.455   Preprocessor1_Model03
##  6 0.0000256 rsq     standard    0.862    20 0.00159 Preprocessor1_Model03
##  7 0.0000409 rmse    standard   56.5      20 0.455   Preprocessor1_Model04
##  8 0.0000409 rsq     standard    0.862    20 0.00159 Preprocessor1_Model04
##  9 0.0000655 rmse    standard   56.5      20 0.455   Preprocessor1_Model05
## 10 0.0000655 rsq     standard    0.862    20 0.00159 Preprocessor1_Model05
## # … with 90 more rows
best_penalty_r <- select_best(tune_res_ridge, metric = "rsq")

Finally, I will use the best model to fit the overall training data into.

ridge_final <- finalize_workflow(ridge_wf, best_penalty_r)

ridge_final_fit <- fit(ridge_final, data = sneakers_train)

We will evaluate the performance that the model has on the training data. This evaluation will be used later on to compare the performance between all of the models.

augment(ridge_final_fit, new_data = sneakers_train) %>%
  sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      56.4  
## 2 rsq     standard       0.863
## 3 mae     standard      31.4

Lasso Regression

This process will be similar to the same process as Ridge Regression.However in this model mixture will be set to 1, and the same parameter will be tuned, which is penalty. That is what classifies this model as a lasso regression model.

# Create the model
lasso_spec <- 
  linear_reg(penalty = tune(), mixture = 1) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

# Create the workflow 
lasso_wf <- workflow() %>%
  add_recipe(sneakers_recipe) %>%
  add_model(lasso_spec)

# Create the tuning grid
penalty_grid_l <- grid_regular(penalty(range = c(-2, 2)), levels = 50)

tune_res_lasso <- tune_grid(
  lasso_wf, 
  resamples = sneakers_fold, 
  grid = penalty_grid_l,
  control = control_grid(verbose = TRUE)
)

# Visualize results
autoplot(tune_res_lasso)

# See performance of all models 
collect_metrics(tune_res_lasso)
## # A tibble: 100 × 7
##    penalty .metric .estimator   mean     n std_err .config              
##      <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
##  1  0.01   rmse    standard   55.3      20 0.439   Preprocessor1_Model01
##  2  0.01   rsq     standard    0.867    20 0.00164 Preprocessor1_Model01
##  3  0.0121 rmse    standard   55.3      20 0.439   Preprocessor1_Model02
##  4  0.0121 rsq     standard    0.867    20 0.00164 Preprocessor1_Model02
##  5  0.0146 rmse    standard   55.3      20 0.439   Preprocessor1_Model03
##  6  0.0146 rsq     standard    0.867    20 0.00164 Preprocessor1_Model03
##  7  0.0176 rmse    standard   55.3      20 0.440   Preprocessor1_Model04
##  8  0.0176 rsq     standard    0.867    20 0.00164 Preprocessor1_Model04
##  9  0.0212 rmse    standard   55.3      20 0.440   Preprocessor1_Model05
## 10  0.0212 rsq     standard    0.867    20 0.00164 Preprocessor1_Model05
## # … with 90 more rows
# Choose best model
best_penalty_l <- select_best(tune_res_lasso, metric = "rsq")

# Fit best model to training data 
lasso_final <- finalize_workflow(lasso_wf, best_penalty_l)

lasso_final_fit <- fit(lasso_final, data = sneakers_train)

# Evaluate performance of model on training data
augment(lasso_final_fit, new_data = sneakers_train) %>%
  sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      55.2  
## 2 rsq     standard       0.867
## 3 mae     standard      30.4

Tree-Based Models

Next tree-based models are going to be created with the folded data.

Decision Tree

For this model the parameter that will be tuned is cost_complexity. The same first few steps from the previous models produced are executed for this model too.

# Create the model
reg_tree_spec <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("regression")

# Create the workflow
reg_tree_wf <- workflow() %>%
  add_model(reg_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_recipe(sneakers_recipe)

# Create tuning grid
param_grid_dt <- grid_regular(cost_complexity(range = c(-4, -1)), levels = 10)

tune_res_dt <- tune_grid(
  reg_tree_wf, 
  resamples = sneakers_fold, 
  grid = param_grid_dt,
  control = control_grid(verbose = TRUE)
)

By using the autoplot function, it can be shown how the value of the cost complexity parameter affects the RMSE and \(R^2\).

# Shows which value of cost complexity produces best rmse and rsq
autoplot(tune_res_dt)

Again like previously seen, the best model will be chosen and fit to the training data and the performance will be evaluated.

# See performance of all models 
collect_metrics(tune_res_dt)%>%
  arrange(std_err) %>% 
  head() 
## # A tibble: 6 × 7
##   cost_complexity .metric .estimator  mean     n  std_err .config              
##             <dbl> <chr>   <chr>      <dbl> <int>    <dbl> <chr>                
## 1        0.000464 rsq     standard   0.947    20 0.000937 Preprocessor1_Model03
## 2        0.000215 rsq     standard   0.958    20 0.000959 Preprocessor1_Model02
## 3        0.0001   rsq     standard   0.966    20 0.000965 Preprocessor1_Model01
## 4        0.00464  rsq     standard   0.862    20 0.00113  Preprocessor1_Model06
## 5        0.001    rsq     standard   0.929    20 0.00126  Preprocessor1_Model04
## 6        0.00215  rsq     standard   0.902    20 0.00143  Preprocessor1_Model05
# Choose best model 
best_complexity <-select_best(tune_res_dt, metric = "rmse")

# Fit best model to training data
reg_tree_final <- finalize_workflow(reg_tree_wf, best_complexity)

reg_tree_final_fit <- fit(reg_tree_final, data = sneakers_train)

Next a visualization of the decision tree will be produced.

# Visualization of the tree
reg_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

# Evaluate performance 
augment(reg_tree_final_fit, new_data = sneakers_train) %>%
  sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      26.3  
## 2 rsq     standard       0.970
## 3 mae     standard      14.8

Random Forest

We will repeat the same process for this model as was seen in the decision tree model. The difference between the random forest model and decision trees is that this model has three parameters that will be tuned: mtry, trees, and min_n. The hyperparameter mtry represents the number of predictors that will be randomly sampled at each split when creating the tree models. The hyperparameter trees represents the number of trees contained in the ensemble. The hyperparameter min_n represents the minimum number of data points in a node that are required for the node to be split further.

# Create the model 
rf_spec <- rand_forest() %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression")

# Create the workflow
rf_wf <- workflow() %>%
  add_model(rf_spec %>% set_args(mtry = tune(), trees = tune(), min_n = tune())) %>%
  add_recipe(sneakers_recipe)

# Create tuning grid 
param_grid_rf <- grid_regular(mtry(range = c(1, 7)), trees(range = c(50, 100)), min_n(range = c(1, 5)), levels = 5)

tune_res_rf <- tune_grid(
  rf_wf, 
  resamples = sneakers_fold, 
  grid = param_grid_rf, 
  metrics = metric_set(rmse),
  control = control_grid(verbose = TRUE)
)

The visualization of this model shows how the number of trees perform for each min_n and how it affects the RMSE.

autoplot(tune_res_rf)

Again like previously seen, the best model will be chosen and fit to the training data and the performance will be evaluated.

# See performance of all models 
collect_metrics(tune_res_rf) %>% 
  head()
## # A tibble: 6 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     1    50     1 rmse    standard   126.     20   0.893 Preprocessor1_Model0…
## 2     2    50     1 rmse    standard    90.9    20   0.613 Preprocessor1_Model0…
## 3     4    50     1 rmse    standard    59.4    20   0.416 Preprocessor1_Model0…
## 4     5    50     1 rmse    standard    51.2    20   0.466 Preprocessor1_Model0…
## 5     7    50     1 rmse    standard    41.2    20   0.395 Preprocessor1_Model0…
## 6     1    62     1 rmse    standard   125.     20   0.994 Preprocessor1_Model0…
# Choose best model 
best_forest <-select_best(tune_res_rf, metric = "rmse")

# Fit best model to training data
rf_final <- finalize_workflow(rf_wf, best_forest)

rf_final_fit <- fit(rf_final, data = sneakers_train)

# Evaluate performance 
augment(rf_final_fit, new_data = sneakers_train) %>%
  sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      38.1  
## 2 rsq     standard       0.942
## 3 mae     standard      21.2

KNN (Nearest Neighbors)

The last model was a model that I attempted to fit however I was not the most confident in running it because we did not get much practice with it in this course. For this model the parameter neighbors was set to be tuned.

# Create the model 
knn_model <- nearest_neighbor(
  neighbors = tune(),
  mode = "regression") %>%
  set_engine("kknn")

# Create the workflow 
knn_wf <- workflow() %>%
  add_model(knn_model) %>%
  add_recipe(sneakers_recipe)

# Set-up tuning grid
knn_params <- parameters(knn_model)

# Define grid
knn_grid <- grid_regular(knn_params, levels = 2)

# Tune the grid
knn_tune <- knn_wf %>%
  tune_grid(
    resamples = sneakers_fold,
    grid = knn_grid,
    control = control_grid(verbose = TRUE)
)

The autoplot function was used to show how the number of neighbors affect the RMSE.

#Visualization of the # of neighbors
autoplot(knn_tune, metric = "rmse")

Next we choose the best model, fit our training data, and evaluate the performance of the model on the training data set.

# See performance of all models 
collect_metrics(knn_tune)
## # A tibble: 4 × 7
##   neighbors .metric .estimator   mean     n std_err .config             
##       <int> <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1         1 rmse    standard   56.0      20 0.586   Preprocessor1_Model1
## 2         1 rsq     standard    0.866    20 0.00224 Preprocessor1_Model1
## 3        15 rmse    standard   51.7      20 0.478   Preprocessor1_Model2
## 4        15 rsq     standard    0.885    20 0.00165 Preprocessor1_Model2
# Choose best model
best_knn <- select_best(knn_tune, metric = "rsq")

# Fit best to training data 
knn_final <- finalize_workflow(knn_wf, best_knn)

knn_final_fit <- fit(knn_final, data = sneakers_train)

# Evaluate performance
augment(knn_final_fit, new_data = sneakers_train) %>%
  sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      39.1  
## 2 rsq     standard       0.935
## 3 mae     standard      18.2

Conclusion

Model Performance

Next we will look at the performance of each model and compare to see which model performed best on the training data. For the linear regression model, it performed with the following metrics: * RMSE: 54.9713892

  • \(R^2\): 0.8682691

  • MAE: 30.2680065

For the ridge regression model, it performed with the following metrics: * RMSE: 56.4243079

  • \(R^2\): 0.8626641

  • MAE: 31.3720788

For the lasso regression model, it performed with the following metrics: * RMSE: 55.207143

  • \(R^2\): 0.867142

  • MAE: 30.440152

For the decision tree model, it performed with the following metrics: * RMSE: 26.3174108

  • \(R^2\): 0.9698074

  • MAE: 14.7527728

For the random forest mode, it performed with the following metrics: * RMSE: 38.1130937

  • \(R^2\): 0.9420715

  • MAE: 21.1902408

For the KNN model, it performed with the following metrics: * RMSE: 39.0606262

  • \(R^2\): 0.9350643

  • MAE: 18.2064622

Based on the results the models that performed, ordered in best performing to least are: * Decision Tree

  • Random Forest

  • KNN

  • Linear Regression

  • Lasso regression

  • Ridge Regression

Therefore based on these results, I will proceed to fit my testing data with the decision tree model.

Model Fitting

I will fit the testing data with the best performing model.

# Fit testing data 
rf_final_fit_test <- fit(rf_final, data = sneakers_test)

After looking at the matrix, we can see how the model actually performed by outputting the RMSE, \(R^2\), and MAE.

# Evaluate Performance for Testing Data
augment(rf_final_fit_test, new_data = sneakers_test) %>%
  sneakers_metrics(truth = profit_perc, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      39.6  
## 2 rsq     standard       0.937
## 3 mae     standard      22.2

Based on the outcome, the decision tree model performed with an \(R^2\) of 0.93, which means that 93% percent of the data was fitted accurately to the model predictions. Overall, this project has analyzed each of the variables in the data set and used them to predict the profit percentage that the company StockX will make off the sale of a shoe. For this data set, the decision tree model was the machine learning model with the best performance.